Search CORE

189 research outputs found

Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially-Aware Language Acquisition

Author: Beskow Jonas
Salvi Giampiero
Stefanov Kalin
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

This paper presents a self-supervised method for visual detection of the active speaker in a multi-person spoken interaction scenario. Active speaker detection is a fundamental prerequisite for any artificial cognitive system attempting to acquire language in social settings. The proposed method is intended to complement the acoustic detection of the active speaker, thus improving the system robustness in noisy conditions. The method can detect an arbitrary number of possibly overlapping active speakers based exclusively on visual information about their face. Furthermore, the method does not rely on external annotations, thus complying with cognitive development. Instead, the method uses information from the auditory modality to support learning in the visual domain. This paper reports an extensive evaluation of the proposed method using a large multi-person face-to-face interaction dataset. The results show good performance in a speaker dependent setting. However, in a speaker independent setting the proposed method yields a significantly lower performance. We believe that the proposed method represents an essential component of any artificial cognitive system or robotic platform engaging in social interactions.Comment: 10 pages, IEEE Transactions on Cognitive and Developmental System

arXiv.org e-Print Archive

Publikationer från KTH

Digitala Vetenskapliga Arkivet - Academic Archive On-line

NORA - Norwegian Open Research Archives

Trainable Articulatory Control Models for Visual Speech Synthesis

Author: Jonas Beskow
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Diffusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation

Author: Alexanderson Simon
Beskow Jonas
Deichler Anna
Mehta Shivam
Publication venue
Publication date: 11/09/2023
Field of study

This paper describes a system developed for the GENEA (Generation and Evaluation of Non-verbal Behaviour for Embodied Agents) Challenge 2023. Our solution builds on an existing diffusion-based motion synthesis model. We propose a contrastive speech and motion pretraining (CSMP) module, which learns a joint embedding for speech and gesture with the aim to learn a semantic coupling between these modalities. The output of the CSMP module is used as a conditioning signal in the diffusion-based gesture synthesis model in order to achieve semantically-aware co-speech gesture generation. Our entry achieved highest human-likeness and highest speech appropriateness rating among the submitted entries. This indicates that our system is a promising approach to achieve human-like co-speech gestures in agents that carry semantic meaning

arXiv.org e-Print Archive

Reverse Engineering Psychologically Valid Facial Expressions of Emotion into Social Robots

Author: Beskow Jonas
Chen Chaona
Garrod Oliver G.B.
Jack Rachael E.
Schyns Philippe G.
Zhan Jiayu
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 07/06/2018
Field of study

Social robots are now part of human society, destined for schools, hospitals, and homes to perform a variety of tasks. To engage their human users, social robots must be equipped with the essential social skill of facial expression communication. Yet, even state-of-the-art social robots are limited in this ability because they often rely on a restricted set of facial expressions derived from theory with well-known limitations such as lacking naturalistic dynamics. With no agreed methodology to objectively engineer a broader variance of more psychologically impactful facial expressions into the social robots' repertoire, human-robot interactions remain restricted. Here, we address this generic challenge with new methodologies that can reverse-engineer dynamic facial expressions into a social robot head. Our data-driven, user-centered approach, which combines human perception with psychophysical methods, produced highly recognizable and human-like dynamic facial expressions of the six classic emotions that generally outperformed state-of-art social robot facial expressions. Our data demonstrates the feasibility of our method applied to social robotics and highlights the benefits of using a data-driven approach that puts human users as central to deriving facial expressions for social robots. We also discuss future work to reverse-engineer a wider range of socially relevant facial expressions including conversational messages (e.g., interest, confusion) and personality traits (e.g., trustworthiness, attractiveness). Together, our results highlight the key role that psychology must continue to play in the design of social robots

Crossref

Enlighten

User Evaluation of the SYNFACE Talking Head Telephone

Author: Eva Agelfors
Giampiero Salvi
Inger Karlsson
Jonas Beskow
Neil Thomas
Publication venue
Publication date: 01/01/2006
Field of study

Abstract. The talking-head telephone, Synface, is a lip-reading support for people with hearing-impairment. It has been tested by 49 users with varying degrees of hearing-impaired in UK and Sweden in lab and home environments. Synface was found to give support to the users, especially in perceiving numbers and addresses and an enjoyable way to communicate. A majority deemed Synface to be a useful product.

CiteSeerX

Crossref

Acupuncture fails to reduce but increases anaesthetic gas required to prevent movement in response to surgical incision.

Author: Beskow A
Bratt O
Christiansson C
Kvorning N
Åkeson Jonas
Publication venue: 'Wiley'
Publication date: 01/01/2003
Field of study

Background: Acupuncture is used for clinical pain relief but has not been evaluated under clinical anaesthesia. This study was designed to compare movement in response to surgical incision in anaesthetized patients subjected to electro-acupuncture (EA) or sham procedures. Our hypothesis was that EA stimulation would reduce the requirements for anaesthetic gas. Methods: Forty-six healthy women, scheduled for laparoscopic sterilization at a Swedish county hospital, were randomized to have either the electro-acupuncture (n = 23) or sham (n = 23) procedure between the induction of general anaesthesia and the start of surgery. The minimal alveolar concentration (MAC) of sevoflurane required to prevent neck or major limb movements in response to surgical incision was determined in each group of patients. Results: The MAC for sevoflurane was found to be higher in the group given acupuncture than in the control group (2.1 ± 0.3% vs. 1.8 ± 0.4%; P = 0.008). Conclusion: Electro-acupuncture given during general anaesthesia with sevoflurane failed to reduce but instead increased the clinical need for anaesthetic gas, possibly by reducing the anaesthetic effect of sevoflurane and/or by facilitating nociceptive transmission and/or reflex activity

Lund University Publications

Matcha-TTS: A fast TTS architecture with conditional flow matching

Author: Beskow Jonas
Henter Gustav Eje
Mehta Shivam
Székely Éva
Tu Ruibo
Publication venue
Publication date: 06/09/2023
Field of study

We introduce Matcha-TTS, a new encoder-decoder architecture for speedy TTS acoustic modelling, trained using optimal-transport conditional flow matching (OT-CFM). This yields an ODE-based decoder capable of high output quality in fewer synthesis steps than models trained using score matching. Careful design choices additionally ensure each synthesis step is fast to run. The method is probabilistic, non-autoregressive, and learns to speak from scratch without external alignments. Compared to strong pre-trained baseline models, the Matcha-TTS system has the smallest memory footprint, rivals the speed of the fastest models on long utterances, and attains the highest mean opinion score in a listening test. Please see https://shivammehta25.github.io/Matcha-TTS/ for audio examples, code, and pre-trained models.Comment: 5 pages, 3 figures. Submitted to ICASSP 202

arXiv.org e-Print Archive

OverFlow: Putting flows on top of neural transducers for better TTS

Author: Beskow Jonas
Henter Gustav Eje
Kirkland Ambika
Lameris Harm
Mehta Shivam
Székely Éva
Publication venue
Publication date: 29/05/2023
Field of study

Neural HMMs are a type of neural transducer recently proposed for sequence-to-sequence modelling in text-to-speech. They combine the best features of classic statistical speech synthesis and modern neural TTS, requiring less data and fewer training updates, and are less prone to gibberish output caused by neural attention failures. In this paper, we combine neural HMM TTS with normalising flows for describing the highly non-Gaussian distribution of speech acoustics. The result is a powerful, fully probabilistic model of durations and acoustics that can be trained using exact maximum likelihood. Experiments show that a system based on our proposal needs fewer updates than comparable methods to produce accurate pronunciations and a subjective speech quality close to natural speech. Please see https://shivammehta25.github.io/OverFlow/ for audio examples and code.Comment: 5 pages, 2 figures. Accepted for publication at Interspeech 202

arXiv.org e-Print Archive